coreference relation
LegalCore: A Dataset for Legal Documents Event Coreference Resolution
Wei, Kangda, Shi, Xi, Tong, Jonathan, Reddy, Sai Ramana, Natarajan, Anandhavelu, Jain, Rajiv, Garimella, Aparna, Huang, Ruihong
Recognizing events and their coreferential mentions in a document is essential for understanding semantic meanings of text. The existing research on event coreference resolution is mostly limited to news articles. In this paper, we present the first dataset for the legal domain, LegalCore, which has been annotated with comprehensive event and event coreference information. The legal contract documents we annotated in this dataset are several times longer than news articles, with an average length of around 25k tokens per document. The annotations show that legal documents have dense event mentions and feature both short-distance and super long-distance coreference links between event mentions. We further benchmark mainstream Large Language Models (LLMs) on this dataset for both event detection and event coreference resolution tasks, and find that this dataset poses significant challenges for state-of-the-art open-source and proprietary LLMs, which perform significantly worse than a supervised baseline. We will publish the dataset as well as the code.
KoCoNovel: Annotated Dataset of Character Coreference in Korean Novels
Kim, Kyuhee, Lee, Surin, Lee, Sangah
In this paper, we present KoCoNovel, a novel character coreference dataset derived from Korean literary texts, complete with detailed annotation guidelines. Comprising 178K tokens from 50 modern and contemporary novels, KoCoNovel stands as one of the largest public coreference resolution corpora in Korean, and the first to be based on literary texts. KoCoNovel offers four distinct versions to accommodate a wide range of literary coreference analysis needs. These versions are designed to support perspectives of the omniscient author or readers, and to manage multiple entities as either separate or overlapping, thereby broadening its applicability. One of KoCoNovel's distinctive features is that 24% of all character mentions are single common nouns, lacking possessive markers or articles. This feature is particularly influenced by the nuances of Korean address term culture, which favors the use of terms denoting social relationships and kinship over personal names. In experiments with a BERT-based coreference model, we observe notable performance enhancements with KoCoNovel in character coreference tasks within literary texts, compared to a larger non-literary coreference dataset. Such findings underscore KoCoNovel's potential to significantly enhance coreference resolution models through the integration of Korean cultural and linguistic dynamics.
Towards Evaluation of Cross-document Coreference Resolution Models Using Datasets with Diverse Annotation Schemes
Zhukova, Anastasia, Hamborg, Felix, Gipp, Bela
Established cross-document coreference resolution (CDCR) datasets contain event-centric coreference chains of events and entities with identity relations. These datasets establish strict definitions of the coreference relations across related tests but typically ignore anaphora with more vague context-dependent loose coreference relations. In this paper, we qualitatively and quantitatively compare the annotation schemes of ECB+, a CDCR dataset with identity coreference relations, and NewsWCL50, a CDCR dataset with a mix of loose context-dependent and strict coreference relations. We propose a phrasing diversity metric (PD) that encounters for the diversity of full phrases unlike the previously proposed metrics and allows to evaluate lexical diversity of the CDCR datasets in a higher precision. The analysis shows that coreference chains of NewsWCL50 are more lexically diverse than those of ECB+ but annotating of NewsWCL50 leads to the lower inter-coder reliability. We discuss the different tasks that both CDCR datasets create for the CDCR models, i.e., lexical disambiguation and lexical diversity. Finally, to ensure generalizability of the CDCR models, we propose a direction for CDCR evaluation that combines CDCR datasets with multiple annotation schemes that focus of various properties of the coreference chains.
Simplification of Patent Claim Sentences for their Paraphrasing and Summarization
Bouayad-Agha, Nadjet (Barcelona Media and Universitat Pompeu Fabra) | Casamayor, Gerard (Barcelona Media and Universitat Pompeu Fabra) | Ferraro, Gabriela (Barcelona Media and Universitat Pompeu Fabra) | Wanner, Leo (ICREA and Universitat Pompeu Fabra)
We present an approach to patent claim simplification which segments claim sentences into clausal discourse units, transforms them into complete sentences, establishes coreference relations and builds a discourse structure between discourse units. The four stages are necessary to allow for the syntactic analysis of otherwise unparsable claim sentences and their regeneration using discourse structure and coreference relations in order to ensure the production of a cohesive and coherent paraphrase/summary.